Corpus-Based Grammar Specialization

نویسندگان

Nicola Cancedda

Christer Samuelsson

چکیده

Broad-coverage grammars tend to be highly ambiguous. When such grammars are used in a restricted domain, it may be desirable to specialize them, in effect trading some coverage for a reduction in ambiguity. Grammar specialization is here given a novel formulation as an optimization problem, in which the search is guided by a global measure combining coverage, ambiguity and grammar size. The method, applicable to any unification grammar with a phrasestructure backbone, is shown to be effective in specializing a broad-coverage LFG for French. 1 I n t r o d u c t i o n Expressive grammar formalisms allow grammar developers to capture complex linguistic generalizations concisely and elegantly, thus greatly facilitating grammar development and maintenance. Broad-coverage grammars, however, tend to overgenerate considerably, thus allowing large amounts of spurious ambiguity. If the benefits resulting from more concise grammatical descriptions are to outweigh the costs of spurious ambiguity, the latter must be brought down. We here investigate a corpus-based compilation technique that reduces overgeneration and spurious ambiguity without jeopardizing coverage or burdening the grammar developer. The current work extends previous work on corpus-based grammar specialization, which applies variants of explanation-based learning (EBL) to grammars of natural languages. The earliest work (Rayner, 1988; Samuelsson and Rayner, 1991) builds a specialized grammar by chunking together grammar rule combinations while parsing training examples. What rules to combine is specified by hand-coded criteria. Subsequent work (Rayner and Carter, 1996; Samuelsson, 1994) views the problem as that of cutting up each tree in a treebank of correct parse trees into subtrees, after which the rule combinations corresponding to the subtrees determine the rules of the specialized grammar. This approach reports experimental results, using the SRI Core Language Engine, (Alshawi, 1992), in the ATIS domain, of more than a 3-fold speedup at a cost of 5% in grammatical coverage, the latter which is compensated by an increase in parsing accuracy. Later work (Samuelsson, 1994; Sima'an, 1999) attempts to automatically determine appropriate tree-cutting criteria, the former using local measures, the latter using global ones. The current work reverts to the view of EBL as chunking grammar rules. It extends the latter work by formulating grammar specialization as a global optimization problem over the space of all possible specialized grammars with an objective function based on the coverage, ambiguity and size of the resulting grammar. The method was evaluated on the LFG grammar for French developed within the PARGRAM project (Butt et al., 1999), but it is applicable to any unification grammar with a phrase-structure backbone where the reference treebank contains all possible analyses for each training example, along with an indication of which one is the correct one. To explore the space of possible grammars, a special treebank representation was developed, called a ]folded treebank, which allows the objective function to be computed very efficiently for each candidate grammar. This representation relies on the fact that all possible parses returned by the original grammar for each training sentence axe available and the fact that the grammar specialization never introduces new parses; it only removes existing ones. The rest of this paper is organized as follows: Section 2 describes the initial candidate grammar and the operators used to generate new candidate grammars from any given one. The function to be maximized is introduced and motivated in Section 3. The folded treebank representation is described in Section 4, while Section 5 presents the experimental results. 2 U n f o l d i n g a n d S p e c i a l i z a t i o n The initial g rammar is the grammar underlying the subset of correct parses in the training set. This is in itself a specialization of the grammar which was used to parse the treebank, since some rules may not show up in any correct parse in the training set; experimental results for this first-order specialization are reported in (Cancedda and Samuelsson, 2000). This grammar is further specialized by inhibiting rule combinations that show up in incorrect parses much more often than in correct parses. In more detail, we considered downward unfolding of g rammar rules (see Fig . l )3 A grammar rule is unfolded downwards on one of the symbols in its r ight-hand side if it is replaced by a set of rules, each corresponding to the expansion of the chosen symbol by means of another g rammar rule. More formally, let G = (E, EN, S, R) be a context-free grammar, and let r , r ' C R, k E .M + such that rhs(r) = aAfl, lal = k 1, lhs(r') = A, rhs(r ' ) = V. The rule adjunction of r I in the k th position of r is defined as a new rule RA(r, k, r ~) = r ' , such that: lhs(r") = lhs(r) rhs(r") = aVfl For unification grammars, we instead require lhs(r') U rhs(r)(k) lhs(r 1') = O(lhs(r))

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Japanese Speech Understanding using Grammar Specialization

The most common speech understanding architecture for spoken dialogue systems is a combination of speech recognition based on a class N-gram language model, and robust parsing. For many types of applications, however, grammar-based recognition can offer concrete advantages. Training a good class N-gram language model requires substantial quantities of corpus data, which is generally not availab...

متن کامل

Grammar Specialization through Entropy Thresholds

Explanation-based generalization is used to extract a specialized grammar from the original one using a training corpus of parse trees. This allows very much faster parsing and gives a lower error rate, at the price of a small loss in coverage. Previously, it has been necessary to specify the tree-cutting criteria (or operationality criteria) manually; here they are derived automatically from t...

متن کامل

The Impact of Teaching Corpus-based Collocation on EFL Learners' Writing Ability

Abstract The present study explores the impact of corpus-based collocation instruction on intermediate Iranian EFL learners' writing ability. For this study, 84 Iranian learners, studying English as a foreign language in Bayan Institute, Iran, were selected and were randomly divided into two groups, experimental and control. Conventional methods of writing instruction were taught to the control...

متن کامل

The Impact of Teaching Corpus-based Collocation on EFL Learners' Writing Ability

متن کامل

Fast Parsing Using Pruning and Grammar Specialization

We show how a general grammar may be automatically adapted for fast parsing of utterances from a specific domain by means of constituent pruning and grammar specialization based on explanation-based learning. These methods together give an order of magnitude increase in speed, and the coverage loss entailed by grammar specialization is reduced to approximately half that reported in previous wor...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2000

Corpus-Based Grammar Specialization

نویسندگان

چکیده

منابع مشابه

Japanese Speech Understanding using Grammar Specialization

Grammar Specialization through Entropy Thresholds

The Impact of Teaching Corpus-based Collocation on EFL Learners' Writing Ability

The Impact of Teaching Corpus-based Collocation on EFL Learners' Writing Ability

Fast Parsing Using Pruning and Grammar Specialization

عنوان ژورنال:

اشتراک گذاری